Matthew Talluto
14.12.2023
Today we will look at three useful functions for manipulating data frames in more complicated/interesting cases.
trees = read.csv("../datasets/tree_abundance.csv")
head(trees)
## pid year Abies.balsamifera Acer.saccharum Betula.papyrifera
## 1 1 1980 4 0 0
## 2 2 2006 40 0 1
## 3 3 2006 36 0 9
## 4 4 1980 3 0 0
## 5 5 1980 4 0 0
## 6 6 1980 16 0 0
## Populus.tremuloides Tsuga.canadensis
## 1 0 0
## 2 0 0
## 3 0 0
## 4 0 0
## 5 0 0
## 6 0 0abundance, is spread across five
colums.species is not encoded in a column at
all! Rather, the information in this variable is encoded in the
column namestrees = read.csv("../datasets/tree_abundance.csv")
head(trees)
## pid year Abies.balsamifera Acer.saccharum Betula.papyrifera
## 1 1 1980 4 0 0
## 2 2 2006 40 0 1
## 3 3 2006 36 0 9
## 4 4 1980 3 0 0
## 5 5 1980 4 0 0
## 6 6 1980 16 0 0
## Populus.tremuloides Tsuga.canadensis
## 1 0 0
## 2 0 0
## 3 0 0
## 4 0 0
## 5 0 0
## 6 0 0abundance, is spread across five
colums.species is not encoded in a column at
all! Rather, the information in this variable is encoded in the
column namesreshape2, can help us convert
between wide and tall data frames with
two functions:meltdcasttrees = read.csv("../datasets/tree_abundance.csv")
# install.packages("reshape2") # run this once, to install the package
library("reshape2")
trees_tall = melt(trees, id.vars = c("pid", "year"), variable.name = "species", value.name = "abundance")
head(trees_tall)
## pid year species abundance
## 1 1 1980 Abies.balsamifera 4
## 2 2 2006 Abies.balsamifera 40
## 3 3 2006 Abies.balsamifera 36
## 4 4 1980 Abies.balsamifera 3
## 5 5 1980 Abies.balsamifera 4
## 6 6 1980 Abies.balsamifera 16abundance, is spread across five
colums.species is not encoded in a column at
all! Rather, the information in this variable is encoded in the
column namesreshape2, can help us convert
between wide and tall data frames with
two functions:dcasttrees = read.csv("../datasets/tree_abundance.csv")
# install.packages("reshape2") # run this once, to install the package
library("reshape2")
trees_tall = melt(trees, id.vars = c("pid", "year"), variable.name = "species", value.name = "abundance")
head(trees_tall)
## pid year species abundance
## 1 1 1980 Abies.balsamifera 4
## 2 2 2006 Abies.balsamifera 40
## 3 3 2006 Abies.balsamifera 36
## 4 4 1980 Abies.balsamifera 3
## 5 5 1980 Abies.balsamifera 4
## 6 6 1980 Abies.balsamifera 16trees_wide = dcast(trees_tall, pid ~ species + year, value.var = "abundance", fill = 0)
trees_wide[1:10, 1:10]
## pid Abies.balsamifera_1975 Abies.balsamifera_1980 Abies.balsamifera_1985
## 1 1 0 4 0
## 2 2 0 0 0
## 3 3 0 0 0
## 4 4 0 3 0
## 5 5 0 4 0
## 6 6 0 16 0
## 7 7 0 0 0
## 8 8 0 4 0
## 9 9 0 30 0
## 10 10 0 0 0
## Abies.balsamifera_1988 Abies.balsamifera_1989 Abies.balsamifera_1990
## 1 0 0 0
## 2 0 0 0
## 3 0 0 0
## 4 0 0 0
## 5 0 0 0
## 6 0 0 0
## 7 0 0 0
## 8 0 0 0
## 9 0 0 0
## 10 0 0 0
## Abies.balsamifera_1993 Abies.balsamifera_1994 Abies.balsamifera_1995
## 1 0 0 0
## 2 0 0 0
## 3 0 0 0
## 4 0 0 0
## 5 0 0 0
## 6 0 0 0
## 7 0 0 0
## 8 0 0 0
## 9 0 0 0
## 10 0 0 0subset gives you a new data frame that is a subset of
the old onebalsam_fir_1980 = subset(trees_tall, species == "Abies.balsamifera" & year == 1980)
head(balsam_fir_1980)
## pid year species abundance
## 1 1 1980 Abies.balsamifera 4
## 4 4 1980 Abies.balsamifera 3
## 5 5 1980 Abies.balsamifera 4
## 6 6 1980 Abies.balsamifera 16
## 7 7 1980 Abies.balsamifera 0
## 8 8 1980 Abies.balsamifera 4
hist(balsam_fir_1980$abundance)subset gives you a new data frame that is a subset of
the old onesubset gives you a new data frame that is a subset of
the old oneaggregate to compute all kinds of summaries
tapply\[ pr(y|x) = pr(y) \]
\[ \mathrm{cov}_{xy} = \frac{\sum_{i=1}^n \left (x_i - \bar{x} \right) \left(y_i - \bar{y} \right)}{n-1} \]
\[ \begin{aligned} \mathrm{cov}_{xy} & = \frac{\sum_{i=1}^n \left (x_i - \bar{x} \right) \left(y_i - \bar{y} \right)}{n-1} \\ r_{xy} & = \frac{\mathrm{cov}_{xy}}{s_x s_y} \end{aligned} \]
\(H_0\): \(r = 0\)
\(H_A\): two sided (\(\rho \ne 0\)) or one-sided (\(\rho > 0\) or \(\rho < 0\))
\(r\) has a standard error:
\[ s_{r} = \sqrt{\frac{1-r^2}{n-2}} \] We can then compute a \(t\)-statistic:
\[ t = \frac{r}{s} \]
The probability that \(t > \alpha\) (i.e., use the CDF of the t distribution) is the p-value.
\[ \begin{aligned} s_{r} &= \sqrt{\frac{1-r^2}{n-2}} \\ t &= \frac{r}{s} \end{aligned} \]
data(penguins, package = "palmerpenguins")
penguins = as.data.frame(penguins)
penguins = penguins[complete.cases(penguins),] # remove NAs
penguins_ade = subset(penguins, species == "Adelie")
n = nrow(penguins_ade)
(r = cor(penguins_ade$body_mass_g, penguins_ade$flipper_length_mm))
## [1] 0.465
(s_r = sqrt((1-r^2)/(n-2)))
## [1] 0.0738
(t_val = r/s_r)
## [1] 6.3
(2 * pt(t_val, n-2, lower.tail = FALSE)) ## two-sided test
## [1] 3.4e-09\[ \begin{aligned} s_{r} &= \sqrt{\frac{1-r^2}{n-2}} \\ t &= \frac{r}{s} \end{aligned} \]
# Equivalent built-in test
with(penguins_ade, cor.test(body_mass_g, flipper_length_mm, alternative = "two.sided"))
##
## Pearson's product-moment correlation
##
## data: body_mass_g and flipper_length_mm
## t = 6, df = 144, p-value = 3e-09
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.327 0.583
## sample estimates:
## cor
## 0.465cor.test and cor)prop.test) or \(\chi^2\) test
(chisq.test)Heterogeneity of subgroups
cor.test(x, y)
##
## Pearson's product-moment correlation
##
## data: x and y
## t = -1, df = 148, p-value = 0.3
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.2447 0.0734
## sample estimates:
## cor
## -0.0879
cor.test(x, y, method = 'spearman')
##
## Spearman's rank correlation rho
##
## data: x and y
## S = 8e+05, p-value = 9e-06
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## -0.357Observation:
| colour | F2 |
|---|---|
| violet | 705 |
| white | 224 |
Observation:
| colour | F2 |
|---|---|
| violet | 705 |
| white | 224 |
\(H_0\): Inheritance is Mendelian (violet:white = 3:1)
\(H_A\): Inheritance is not exactly Mendelian
# Test of proportions against a null hypothesis
# For small sample sizes, use binom.test
counts = matrix(c(705, 224), ncol = 2)
prop.test(counts, p = 0.75, alternative = "two.sided")
##
## 1-sample proportions test with continuity correction
##
## data: counts, null probability 0.75
## X-squared = 0.3, df = 1, p-value = 0.6
## alternative hypothesis: true p is not equal to 0.75
## 95 percent confidence interval:
## 0.730 0.786
## sample estimates:
## p
## 0.759woodpecker = read.csv("../datasets/woodpecker.csv")
head(woodpecker)
## forest_type nest_tree
## 1 burned birch
## 2 burned birch
## 3 burned jack pine
## 4 burned aspen
## 5 burned birch
## 6 burned jack pine
table(woodpecker)
## nest_tree
## forest_type aspen birch jack pine
## burned 6 16 2
## unburned 29 18 23We want to test for an association between the two variables (forest type and nest tree)
\(H_0\): Nesting tree is not associated with forest type
\(H_A\): Nest tree is associated with forest type
table(woodpecker)
## nest_tree
## forest_type aspen birch jack pine
## burned 6 16 2
## unburned 29 18 23
table(woodpecker)/rowSums(table(woodpecker))
## nest_tree
## forest_type aspen birch jack pine
## burned 0.2500 0.6667 0.0833
## unburned 0.4143 0.2571 0.3286We want to test for an association between the two variables (forest type and nest tree)
\(H_0\): Nesting tree is not associated with forest type
\(H_A\): Nest tree is associated with forest type
table(woodpecker)
## nest_tree
## forest_type aspen birch jack pine
## burned 6 16 2
## unburned 29 18 23
table(woodpecker)/rowSums(table(woodpecker))
## nest_tree
## forest_type aspen birch jack pine
## burned 0.2500 0.6667 0.0833
## unburned 0.4143 0.2571 0.3286We want to test for an association between the two variables (forest type and nest tree)
\(H_0\): Nesting tree is not associated with forest type
\(H_A\): Nest tree is associated with forest type
You see this a lot. When should you do it? NEVER
This is almost as bad, and sadly much more common
Barplots, or proportional bars for counts within categories
table(woodpecker)
## nest_tree
## forest_type aspen birch jack pine
## burned 6 16 2
## unburned 29 18 23
woodp_plot = ggplot(woodpecker, aes(x = nest_tree,
fill = forest_type)) + theme_minimal()
woodp_plot = woodp_plot + geom_bar(width = 0.5)
woodp_plotStacked bars are “unfair” — easiest to compare the “rooted” class (unburned).
Barplots, or proportional bars for counts within categories
table(woodpecker)
## nest_tree
## forest_type aspen birch jack pine
## burned 6 16 2
## unburned 29 18 23
woodp_plot = ggplot(woodpecker, aes(x = nest_tree,
fill = forest_type))
woodp_plot = woodp_plot + geom_bar(width = 0.5,
position=position_dodge())
woodp_plot = woodp_plot + xlab("Nest Tree Type") +
theme_minimal() + labs(fill = "Forest Type")
woodp_plotSide-by-side bars allow us to compare all categories on equal footing.
Scatterplots become less useful.
birddiv = read.csv("../datasets/birddiv.csv")
bird_plot = ggplot(birddiv, aes(x=forest_frag,
y = richness, colour = bird_type)) +
geom_point() + theme_minimal()
head(birddiv)
## Grow.degd For.cover NDVI bird_type richness forest_frag
## 1 330 99.9 60.38 forest 8 1
## 2 330 0.0 22.88 forest 1 0
## 3 330 38.3 11.86 forest 5 3
## 4 330 11.4 19.07 forest 7 7
## 5 330 0.0 2.12 forest 2 0
## 6 170 100.0 54.03 forest 7 1Adding jitter can sometimes improve things
Better solution: boxplots
Correlation assumes nothing about the causal nature of the relationship between x and y
Often, we have a hypothesis that a variable is in some way functionally dependent on another
A simple linear regression describes/tests the relationship between
\[ \begin{aligned} \mathbb{E}(y|x) & = \hat{y} = \alpha + \beta x \\ & \approx a + bx \\ \\ \end{aligned} \]
\(y\) is not perfectly predicted by \(x\), so we must include an error (residual) term:
\[ \begin{aligned} y_i & = \hat{y_i} + \epsilon_i \\ & = a + bx_i + \epsilon_i\\ \\ \epsilon & \sim \mathcal{N}(0, s_{\epsilon}) \end{aligned} \]
\[ \begin{aligned} \hat{y_i} & = a + b x_i \\ y_i & = \hat{y_i} + \epsilon_i \\ \epsilon & \sim \mathcal{N}(0, s_{\epsilon}) \end{aligned} \]
We want to solve these equations for \(a\) and \(b\)
The “best” \(a\) and \(b\) are the ones that draw a line that is closest to the most points (i.e., that minimizes \(s_{\epsilon}\))
\[ s_{\epsilon} = \sqrt{\frac{\sum_{i=1}^n \left (y_i -\hat{y_i}\right )^2}{n-2}} \]
\[ \begin{aligned} \mathrm{ESS} & = \sum_{i=1}^n \left (y_i -\hat{y_i}\right )^2 \\ & = \sum_{i=1}^n \left (y_i - a - bx_i \right )^2 \end{aligned} \]
Solving for the minimum ESS yields:
\[ \begin{aligned} b & = \frac{\mathrm{cov_{xy}}}{s^2_x} \\ \\ & = r_{xy}\frac{s_y}{s_x}\\ \\ a & = \bar{y} - b\bar{x} \end{aligned} \]
The parameters have standard errors:
\[ s_a = s_{\hat{y}}\sqrt{\frac{1}{n} + \frac{\bar{x}^2}{\sum{(x-\bar{x}}^2)}} \]
\[ s_b = \frac{s_{\hat{y}}}{\sqrt{\sum{(x-\bar{x}}^2)}} \]
\(\mathbf{H_0}\): The slope \(\beta\) = 0 (i.e., no variation in \(y\) is explained by variation in \(x\))
The ratio of variance explained to residual variance follows an \(F\) distribution
\[\begin{aligned} F & = \frac{MS_{model}}{MS_{err}} \\ MS_{model} & = \sum_{i=1}^n \left ( \hat{y}_i - \bar{y}\right)^2 \\ MS_{err} & = \frac{\sum_{i=1}^n \left ( y_i - \hat{y}_i \right)^2}{n-1} \end{aligned}\]The coefficient of determination tells you the proportion of variance explained by the model:
\[ r^2 = \frac{\sum_{i=1}^n \left ( \hat{y_i} - \bar{y} \right )^2} {\sum_{i=1}^n \left ( y_i - \bar{y} \right )^2} \]
## Data on sparrow wing lengths from Zar (1984)
dat = data.frame(age = c(3:6, 8:12, 14:17),
wing_length = c(1.4, 1.5, 2.2, 2.4, 3.1, 3.2, 3.2,
3.9, 4.1, 4.7, 4.5, 5.2, 5.0))
mod = lm(wing_length ~ age, data = dat)
summary(mod)
##
## Call:
## lm(formula = wing_length ~ age, data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.3070 -0.2154 0.0655 0.1632 0.2251
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.7131 0.1479 4.82 0.00053 ***
## age 0.2702 0.0135 20.03 5.3e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.218 on 11 degrees of freedom
## Multiple R-squared: 0.973, Adjusted R-squared: 0.971
## F-statistic: 401 on 1 and 11 DF, p-value: 5.27e-10
confint(mod)
## 2.5 % 97.5 %
## (Intercept) 0.388 1.04
## age 0.241 0.30par(mfrow=c(1, 2), bty='n')
## scale(residuals) produces **standardized** residuals
qqnorm(scale(residuals(mod)), pch=16)
qqline(scale(residuals(mod)), lty=2)
plot(dat$age, residuals(mod), xlab = "age", ylab = "residuals", pch=16)
abline(h = 0, lty=2)We found a significant positive relationship between wing length and
age (\(F_{1,11} = 400\), \(p < 0.001\), \(R^2 = 0.97\); Table 1).
| Estimate | St. error | 95% CI | |
| Intercept | 0.71 | 0.15 | [0.39, 1.03] |
| Age | 0.27 | 0.013 | [0.24, 0.30] |